Data Preparation and Analysis

This chapter focuses on the tasks which the user undertakes as part of the estimation process. Topics include:

Overview

There are a series of data preparation tasks which are discussed in the following sections. Most of the tasks only require data files to be created in a relatively mechanistic manner, but two of the tasks require the user to make considered choices. These are discussed in Screenlines and in Setting confidence levels.

The final sections in this chapter explain the estimation stage in terms of tasks facing the user. As CUBE Analyst usually requires minimal input from the user, apart from the supply of prepared data files, the estimation stage is very straightforward. However, advice is given on possible ways of improving the speed of estimation. This may be achieved through:

•Influencing the strategy used to calculate the Hessian matrix, which is used in the optimization stages of CUBE Analyst—see Tuning estimation performance

•Avoiding unnecessary detail in the routing files, which can be burdensome for the data processing elements of CUBE Analyst—see Control of routing information

The final set of activities for the user are to analyze the results to assess the quality of the estimation, partly to determine if and how they might need to be improved. This topic is discussed in Analyzing the results.

The ideas introduced in this chapter are subsequently illustrated in later chapters with an example application of CUBE Analyst, based on an actual study. Further details on points covered in this section are provided in the standardized estimation procedures.

Matrices

CUBE is used to set or modify individual cells or ranges of cells. This also permits confidence levels to be easily set to global or individual values. For example, you can use a prior matrix (Table 101) to give information about basic trip patterns.

Prior matrix (Table 101)

|                                                          | 20 |
|  TABLE = 101   (Prior         )                          | 20 |
|          1     2    3    4    5    6    7    8    9   10 | 20 |
|          ------------------------------------------------+ 20 |
|  1:      1     1    0    5   45  126   50   21   30   55 | 20 |
|  2:      1     5    0   70  125   36   38   50   58   14 | 20 |
|  3:      1     1    0    2  108  119   90   69  148   44 | 20 |
|  4:     69     3    0    1    6    7    6    3   25    3 +----+
|  5:    100     1    0  192   71   20   12   11   14    7 |
|  6:     36     2    0   88   52    6    3    7   16   13 |
|  7:     62     3    0   32   36   58    9   63    9   61 |
|  8:      0     1    0   64   65   30  119   19  121   64 |
|  9:      0     7    0   57  123   70  178  279    7   38 |
| 10:      0    10    0    7   31    3    1   10   21    3 |
| 11:      0    13    0   19   35    4   96  170   28   29 |
| 12:      0     5    0   41  286   52  103  117   29   56 |
| 13:      0     9    0   24   99   50   90   91   23   12 |
| 14:      4     3   14   20   56   19   67   58   21    7 |
| 15:     28     2   36    1  185    1    1    2   15    1 |
+----------------------------------------------------------+

You can use an associated confidence matrix (Table 102) to discriminate between data reliability for different groups of movements.

Confidence levels (Table 102)

|                                                             |
|  TABLE = 102 (Confidences   )                               |
|              1    2    3    4    5    6    7    8    9   10 |
+-------------------------------------------------------------+
|      1:     20   20   20   20   40   40   20   20   20   20 |
|      2:     20   20   20   20   40   40   20   20   20   20 |
|      3:     20   20   20   20   40   40   20   20   20   20 |
|      4:     20   20   20   20   40   40   20   20   20   20 |
|      5:     40   40   40   40   40   40   40   40   40   40 |
+-+--------------------------------------------------------+ 40

Intrazonals can be included in the matrix. Note that because routings only cover inter-zonal trips, the intrazonals will not be affected by the screenline counts. They will just impact on the trip ends. So as their role is limited, there is a case for omitting intrazonals from the estimation. Note that if intrazonals are included in the trip ends, then they should also be included in the matrix. If the trip ends do not include intrazonals, the intrazonal cells of the input matrices should be zero.

Trip ends

Trip ends may be determined either by reference to an existing matrix, surveys (for example, of parking), or they may be calculated from equations.

Networks and traffic and passenger counts

CUBE is used for preparing networks. Traffic and passenger counts, together with confidence level information, is input into the volume field storage areas associated with each link.

Screenlines

Screenlines are used to minimize the effects of assignment errors. Screenlines are defined as the set of count sites which intercept traffic/passenger flows between sets of zones which share the same general corridors of movement (across which the screenlines are suitably located).

The extent of a screenline is determined by the number of alternative (reasonable) paths which are available. In many public transport networks where services are sparse, or in rural highway networks, there may only be a single reasonable route between one general area and another. In this case, screenlines may correspond to single links (although they are still treated as screenlines in this context of CUBE Analyst). In general, however, a screenline will represent a set of links.

In the case of highways, a useful type of screenline is provided by a river or a railway line, that has only a few crossing points. In this case all traffic must be routed through known points, and so assignment error associated with the screenline will be minimized. For CUBE Analyst, there is no difference between a group of traffic counts on separate links (that form a logical screenline) and a single link count amalgamating the flows on separate traffic lanes.

There will normally be few, if any, screenlines that entirely bisect a study area and so intercept all trips either side of it. CUBE Analyst therefore employs the concept of partial screen lines. They are partial in the sense that they do not extend between the boundaries of a study area, but they intercept all trips between, at least, certain defined pairs of zones.

The method for defining such partial screenlines is manual, and based partly on judgement and the availability of count data sites.

The routing information, together with user-defined screenlines, is used to define the set of O-D pairs whose routes they intercept. The aim is to group count sites into screenlines that balance the objectives to:

1.Maximize the number of O-D pairs that have all routes passing through a screenline.

2.Minimize the number of O-D pairs per screenline, as this maximizes the information value of the counts for the corresponding matrix cells.

The following figure shows an example of screenlines for an example urban area.

Features that these screenline locations demonstrate are shown in the following table.

Screenline location	Function
Northern	Screenline over a single link (for example, a bridge) intercepts all traffic to and from the North.
Western	Parallel, alternative routes from the West require a single screenline intercepting both routes for this corridor
Southern Ring Road	Non-radial traffic is intercepted by (two) screenlines on orbital road
Eastern	Similar parallel routes for long distance traffic to Western side, but parallel routes for local traffic require additional, shorter screenline. Note use of count location in more than one screenline.
Central Area	Detailed movements in centre intercepted with several short screenlines.

Routings

Matrix estimation requires information about which routes are used to connect each pair of origin and destination zones, and the probability that each route is used. Ideally this would come from survey information, but this is onerous and not very practical, so the method uses modeling instead. This routing information is one of the outputs from the assignment process. For TRIPS users it is stored in the route choice probability (RCP) file. For CUBE Voyager Highway users, it is stored in the CUBE Voyager path file. For CUBE Voyager Public Transport users, it is stored in route files.

This section discusses two types of routings:

Highways

The main requirement for CUBE Analyst is for the routings to reflect all reasonable alternative paths whilst avoiding spreading out too much so that they become unrealistic.

For CUBE Voyager users, the paths reflected in the Intercept file derive from combining the all-or-nothing paths from each assignment iteration into one set. This can be done directly in the HIGHWAY program. Alternatively, HIGHWAY can be used to generate a path file, and the appropriate path sets and volumes selected from it for use in CUBE Analyst.

TRIPS users could use a similar approach, or apply one of the stochastic methods. When considering networks where congestion is a factor, the assignment itself relies on the trip matrix that the estimation is trying to provide. Hence it may be preferable to apply routes derived using methods that can calculate multiple routes between zones based on stochastic (statistical) methods, rather than to rely on the paths from a capacity-restrained assignment. TRIPS supports two such methods, known after their originators as Burrell and Dial. Both methods can be used successfully with CUBE Analyst, but Burrell can have limitations in large networks when routes traverse large numbers of links. In this case, the central limit theorem of statistics means that the chances of routes having the same cost for a different set of randomized link costs (which is the approach used in Burrell) become higher the more links occurring on an average route. The consequence of this is that it is more difficult to generate varied routes. (It can be noted, in passing, that the length of routes in terms of distance is not a problem for the implementation of Burrell used in TRIPS.) The Dial method is not subject to this effect concerning routes with many links so it is the approach that is advised. Note that in cases where estimation is being used to update a matrix that is not anticipated to have changed by very much, for instance, it was obtained from a relatively recent survey, then the RCP file from an existing converged capacity restraint assignment may be used in preference to Dial. The choice here is a matter of judgement on relative accuracies of the RCP information.

Public transport

CUBE Voyager PT outputs to route files by user class. Many controls affect the routing, but a factors file provides a means to determine the extent of multirouting. TRIPS automatically produces multiroute paths and can also store them in a RCP File. The determination of which links are used to connect pairs of origin-destination zones is a function of a path building algorithm which generates a set of reasonable paths. These are based on considerations of generalized cost, which reflect users’ data about transit times, fares, boarding and transfer penalties, and so on. A submode split model can be used to reflect passenger biases when deciding if different modes (bus, metro, rail, etc.) are candidates for inclusion into the set of reasonable paths.

Setting confidence levels

Mathematically, confidence levels have the dual facets of being sampling rates and weighting factors. Confidence levels are entered as percentages but, from both points of view, values of greater than 100 are legitimate.

This section discusses:

Characteristics of the data

The ability of a confidence level to help match an estimated data item (trip end, screenline flow, matrix cell) to its corresponding observed value is influenced by:

1. Data consistency

If data is consistent and free of errors, then the confidence levels will have no influence as they, essentially, help to mediate between different estimates implied by different data items. Conversely, more discrepancies within the data increase the importance of confidence levels.

2. Data quantity

As all data is present in the objective function (see Maximum likelihood objective function), the quantity of data is influential, besides the confidence levels. This means that, for example, relatively large confidence levels applied to the prior matrix, which has many data elements, will tend to restrict the scope of a few count sites to influence the estimated matrix to a significant degree. Of course, this may be the desired effect in some circumstances.

An improved match with any data item can always be achieved with an arbitrarily large confidence level, but it will normally be necessary for users to check the appropriateness of confidence levels that are input.

Deciding on confidence values

A practical approach to setting confidence levels is often to establish a dataset as a reference benchmark, and then set the confidence levels of other data relative to this. For example, if a program of automatic counting means that traffic counts are well and recently observed, then these may be given a high confidence level, say 100, and confidences for other data set relative to that value.

Note that an implied range of 1 - 100 (or of that order of magnitude) has been found to be suitable for many studies. Large applications (say, of 500 zones or more) will tend to encounter a greater range of absolute data values, which can imply the need for a wider range of confidence levels (see the discussion above). The need for this is suitably assessed by means of sensitivity analysis on the confidence levels.

Some general observations applying to confidence levels for different categories of data are given below, in descending order of magnitude of confidence levels for most applications:

At least some count sites should have observations made over several days (weeks, etc.) to determine basic levels of variability associated with single observations.
Count confidences should be set with respect to the time period applying to the estimated matrix (for example, a series of counts made on Tuesdays is only a partial observation if the matrix is to correspond to an average working day).
In the case of highways, trip end confidences are unlikely to exceed count confidences, and will usually be less due to observational difficulties; in the case of public transport, the two sets of confidences are more likely to be similar.
Even when trip ends have been determined simply from the row and column totals of the prior matrix, the aggregation of the data means that the trip end confidences will be higher than the corresponding individual cell confidences. For this reason, trip end data should always be used when a prior matrix is input.
Prior matrix cells are, individually, unlikely to have high confidences even when collected by recent, good surveys because there are so many elements of the matrix. This becomes truer as the number of study area zones increases (due to the difficulty of observing all possible movements adequately).
Cost matrix data may be obtained reasonably reliably, but the relevant confidence concerns the use of this data for trip estimation and this normally only offers an approximation.

Tuning estimation performance

In general CUBE Analyst should be run with default parameter settings. In the majority of cases this will lead to a converged solution, within a reasonable number of iterations.

In some cases an excessive number of iterations may be required or CUBE Analyst may be unable to find a converged solution. In the latter case CUBE Analyst will report that it has halted optimization for a reason such as No further progress possible—linear search failed, rather than the successful message Convergence detected. Such a message is usually caused by excessively inconsistent data being input to CUBE Analyst which pulls the optimizer in opposite directions to the extent that no solution can be found.

To correct this, the user is normally required to check the input data. However, CUBE Analyst does provide an extra control in the form of the parameter ITERH. This determines the frequency by iteration for the calculation of the Hessian matrix (see Optimizer: Finding the minimum value) which directs the optimizer towards the solution. Although this calculation is a time consuming process, it will result in the optimizer converging in significantly fewer iterations. For the case of unconverged problems, recalculation of the Hessian may provide the direction which the optimizer needs to find a solution. For example, if a problem was halted after 58 iterations, try setting ITERH=50 to see if a new Hessian will allow the optimizer to converge.

In most cases, recalculation of the Hessian matrix will result in longer run times. In particular, time will be wasted if ITERH is set to low values such as 40 or less. CUBE Analyst will determine a suitable value for ITERH. It is only recommended for the user to set ITERH in order to attempt to solve convergence problems (which are encountered only exceptionally).

Control of routing information

For many estimation runs, the production of the O-D intercepts for screenlines and/or part-trip data takes as much or even more time than the actual estimation itself. CUBE Analyst just needs the reasonable paths so controlling the routing to avoid the production of routes used only by a small proportion of trips is an important aspect of achieving practical run times for the estimation. This is particularly the case for public transport which can often supply a huge variety of routes. For large models this could result in the production of the intercepts requiring an excessive time to complete; this can be an order of magnitude greater than if parameters are given appropriate settings. Too many routes can also result in file sizes becoming too large for practical use.

Routing information can be supplied to CUBE Analyst in the form of a TRIPS RCP file, or CUBE Voyager path file. CUBE Voyager can also supply an intercept file via the Highway and PT programs. If an intercept file is not input, then before starting the estimation proper, CUBE Analyst analyses the routes through screenlines and/or part trip links to produce the intercepts which it saves in an ICP file. It is important to note that this intercept file can be input back into subsequent estimations as long as the links of the screenlines and/or part trip data are not modified. This is achieved by setting option INTCPT=T or WARMST=T as appropriate and will result in a considerable time saving.

Analyzing the results

CUBE Analyst produces its results as a set of tabulation for printing or viewing, and as a set of files which may be subject to further analysis—one of these files is the estimated matrix itself.

The tabulations in CUBE Analyst’s printout are ordered as follows, after the standard header information:

1.Summary of input data characteristics, showing:

Data types were used in the estimation
Average confidence levels, and their ranges
the number of data elements for each type of data.

This information indicates the relative weighting of data in the estimation process, which is important to know when assessing the results.

2.A summary of the values of key indicators from the last five iterations before the optimization halted. The indicators, and their values, are the same as CUBE Analyst outputs to the screen during the course of its calculations. They are:

Iteration number
Step size
Value of the objective function
Estimated matrix total number of trips

The reason for halting is also shown, which will normally be convergence detected.

This information is mainly provided for confirmation that the estimation calculations operated in an appropriate manner (for example, that the objective function value never increased). These two elements of CUBE Analyst’s printout are shown in Results of estimation—including part trip data (and in an abbreviated form in Confidence and convergence summary);

3.The remainder of CUBE Analyst’s tabulations are concerned with comparisons between the user’s input data and the corresponding values derived from the estimated matrix.

Comparative information is output, when applicable, for:

Trip matrix totals
Part-trip data
Total trip generations from zones
Total trip attractions to zones
Screenline flow counts

The general pattern of this comparative information from CUBE Analyst is shown in Results of estimation—including part trip data (Trip end comparison of prior (observed) and estimated values and Screenline comparison of prior (observed) and estimated values contain this information in a slightly altered format).

Results of estimation—including part trip data and Results of estimation—including part trip data illustrate the case for CUBE Analyst including part-trip data. Hierarchic estimation output conforms to this same basic pattern, but extra information is provided, as explained in Chapter 7, Hierarchic Estimation and illustrated in Figures 8.12a - 8.12d. xxx

As a rule, the user will be looking for good correspondences between input data and estimated results. However, it is important to note that a poor comparison between input and estimated information is not, by itself, a sign of a poor quality estimation. The reason is (or should be) that a data item with a higher confidence level is dominating the estimation with respect to data which is also relevant, but which has a lower confidence level.

The approach to analyzing CUBE Analyst’s comparative results is, therefore, to identify data which has not been matched well in the estimation and to determine what the other data might be causing the discrepancy. Often this is straightforward, for example, a screenline flow count with a markedly different value from trip end values for adjacent zones. If the discrepancy seems unwarranted then this may be a cause to review either the data values themselves, or their confidence levels. (One cause of discrepancies which may not be immediately apparent, is poor routing information, for example, on account of inappropriate generalized cost parameters.)